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Abstract. A standard approach in pattern classification is to estimate 
the distributions of the label classes, and then to apply the Bayes clas- 
sifier to the estimates of the distributions in order to classify unlabeled 
examples. As one might expect, the better our estimates of the label class 
distributions, the better the resulting classifier will be. In this paper we 
make this observation precise by identifying risk bounds of a classifier in 
terms of the quality of the estimates of the label class distributions. We 
show how PAC learnability relates to estimates of the distributions that 
have a PAC guarantee on their L\ distance from the true distribution, 
and we bound the increase in negative log likelihood risk in terms of 
PAC bounds on the KL-divergence. We give an inefficient but general- 
purpose smoothing method for converting an estimated distribution that 
is good under the L\ metric into a distribution that is good under the 
KL-divergence. 
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1 Introduction 

We consider a general approach to pattern classification in which elements of 
each class are first used to train a probabilistic model via some unsupervised 
learning method. The resulting models for each class are then used to assign 
discriminant scores to an unlabeled instance, and a label is chosen to be the one 
associated with the model giving the highest score. For example [3] uses this 
approach to classify protein sequences, via training a well-known probabilistic 
suffix tree model of Ron et al. ^S] on each sequence class. Indeed, even where 
an unsupervised technique is mainly being used to gain insight into the process 
that generated two or more data sets, it is still sometimes instructive to try out 
the associated classifier, since the misclassification rate provides a quantitative 
measure of the accuracy of the estimated distributions. 
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The work of ^B) has led to further related algorithms for learning classes of 
probabilistic finite state automata (PDFAs) in which the objective of learning has 
been formalized as the estimation of a true underlying distribution (over strings 
output by the target PDFA) with a distribution represented by a hypothesis 
PDFA. The natural discriminant score to assign to a string, is the probability 
that the hypothesis would generate that string at random. 

As one might expect, the better one's estimates of label class distributions 
(the class-conditional densities), the better should be the associated classifier. 
The contribution of this paper is to make precise that observation. We give 
bounds on the risk of the associated Bayes classifier 1 in terms of the quality of 
the estimated distributions. 

These results are partly motivated by our interest in the relative merits of 
estimating a class-conditional distribution using the variation distance, as op- 
posed to the KL-divergence (defined in the next section). In 0] it has been 
shown how to learn a class of PDFAs using KL-divergence, in time polynomial 
in a set of parameters that includes the expected length of strings output by 
the automaton. In ^1 we show how to learn this class with respect to variation 
distance, with a polynomial sample-size bound that is independent of the length 
of output strings. Furthermore, it can be shown that it is necessary to switch 
to the weaker criterion of variation distance, in order to achieve this. We show 
here that this leads to a different — but still useful — performance guarantee for 
the Bayes classifier. 

Abe and Warmuth 2 study the problem of learning probability distributions 
using the KL-divergence, via classes of probabilistic automata. Their criterion 
for learnability is that — for an unrestricted input distribution D — the hypothesis 
PDFA should be almost (i.e. within e) as close as possible to D. Abe, Takeuchi 
and Warmuth pQ study the negative log-likelihood loss function in the context 
of learning stochastic rules, i.e. rules that associate an element of the domain 
X to a probability distribution over the range Y . We show here that if two or 
more label class distributions are learnable in the sense of [3] , then the resulting 
stochastic rule (the conditional distribution over Y given x £ X) is learnable in 
the sense of pQ. 

We show that if instead the label class distributions are well estimated using 
the variation distance, then the associated classifier may not have a good negative 
log likelihood risk, but will have a mis classification rate that is close to optimal. 
This result is for general fc-class classification, where distributions may overlap 
(i.e. the optimum misclassification rate may be positive). We also incorporate 
variable misclassification penalties (sometimes one might wish a false positive to 
cost more than a false negative), and show that this more general loss function 
is still approximately minimized provided that discriminant likelihood scores are 
rescaled appropriately. 

1 The Bayes classifier associated with two or more probability distributions is the 
function that maps an element x of the domain to the label associated with the 
probability distribution whose value at x is largest. This is of course a well-known 
approach for classification, see |7|. 
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As a result we show that PAC-learnability and more generally p-concept 2 
learnability ^2] , follows from the ability to learn class distributions in the setting 
of Kearns et al. Papers such as |5I14I8| study the problem of learning various 
classes of probability distributions with respect to KL-divergence and variation 
distance, in this setting. 

It is well-known (noted in that learnability with respect to KL-divergence 
is stronger than learnability with respect to variation distance. Furthermore, the 
KL-divergence is usually used (for example in |4ll()j ) due to the property that 
when minimized with respect to an sample, the empirical likelihood of that sam- 
ple is maximized. An algorithm that learns with respect to variation distance 
can sometimes be converted to one that learns with respect to KL-divergence 
by a smoothing technique 5 , when the domain is {0, 1}", and n is a parameter 
of the learning problem. In this paper we give a related smoothing rule that ap- 
plies to the version of the PDFA learning problem where we seem to "need" to 
use the variation distance. However, the smoothed distribution does not have an 
efficient representation, and requires the probabilities used in the target PDFA 
to have limited precision. 

1.1 Notation and Terminology 

In fc-class classification, labeled examples are generated by distribution D over 
X x {1, k}. We consider the problem of predicting the label £ associated with 
x G X, where x is generated by the marginal distribution of D on X, D\x- A 
non-negative cost is incurred for each classification, based either on a cost matrix 
(where the cost depends upon both the hypothesized label and the true label) or 
the negative log-likelihood of the true label being assigned. The aim is to optimize 
the expected cost given by the occurrence of a randomly generated example. We 
refer to the expected cost associated with any classifier / : X — > {1, ...,&}, as 
risk (as described by Vapnik JJj ), denoted as R(f). 

Let Dg be D restricted to points (x,£), I = {1, k}. D is a mixture J2e=i 9tDt, 
where X)«=i 9i ~ 1; and gi is the class prior of class I — the probability that a 
randomly generated data point has label I. 

In Section [3 it is shown that if we have upper bounds on the inaccuracy 
of the estimated distributions of each class label, then we can derive bounds 
on the risk associated with the classifiers. Suppose D and D' are probability 
distributions over the same domain X. We define the L\ distance as Li(D, D') = 
J x \D(x) — D'(x)\ dx. We usually assume that A is a discrete domain, in which 
case 

Li{p,&)= ]T \D(x)-D'(x)\. 
The KL-divergence from D to D' is defined as 

I(D\\D') = J2 D W l °Z 
xex 

2 p-concepts are functions probabilistically mapping elements of the domain to 2 

classes. 
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1.2 Learning Framework 

In the PAC-learning framework an algorithm receives labeled samples generated 
independently according to distribution D over X , where distribution D is un- 
known, and where labels are generated by an unknown function / from a known 
class of functions T . The algorithm must output a hypothesis h from a class of 
hypotheses TC, such that with probability at least 1 — S, err^ < e, where e and 5 
are parameters. Notice that in this setting, if / S TL, then err* — 0, where err* 
is the error associated with the optimal hypothesis. 

We use a variation on the framework used in |12| for learning p-concepts, 
which adopts performance measures from the PAC model, extending this to learn 
stochastic rules with k classes. Therefore it is the case that err* = inf^ e -^{err^}. 
The aim of the learning algorithm in this framework is to output a hypothesis 
h e H such that with probability of at least 1 — 6, the error err^ of h satisfies 
errh < err* + e. 

Our notion of learning distributions is similar to that of Kearns et al. 1 1 1 1 . 

Definition 1. Let T> n be a class of distributions. T> n is said to be efficiently 
learnable if an algorithm A exists, such that given e > and S > and access 
to randomly drawn examples (see below) from any unknown target distribution 
D 6 T> n , A runs in time polynomial in (i), Q) and n and returns a proba- 
bility distribution D' that with probability at least 1 — 5 is within L\-distance 
(alternatively KL- divergence) e of D. 

We define p-concepts as introduced by Kearns and Shapire ^2). This defini- 
tion is for 2-class classification, but generalizes in a natural way to more than 2 
classes. 

Definition 2. A Probabilistic Concept (or p- concept) f on domain X is given 
by a real-valued function pf : X — > [0,1]. An observation of f consists of some 
x G X together with a 0/1 label £ with Pr(£ = 1) = Pf(x). 

2 Results 

In Section IO we give bounds on the risk associated with a hypothesis, with 
respect to the accuracy of the approximation of the underlying distribution gen- 
crating the instances. In Section 12.21 we show that these bounds are close to 
optimal, and in Sect ion [2.31 we give corollaries showing what these bounds mean 
for PAC learnability. 

We define the accuracy of an approximate distribution in terms of L\ dis- 
tance and KL divergence, both of which are commonly used measurements. It 
is assumed that the class priors of each class label are known. 

2.1 Bounds on Increase in Risk 

First we examine the case where the accuracy of the hypothesis distribution is 
such that the distribution for each class label is within L\ distance e of the true 
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distribution for that label, for some < e < 1. A cost matrix C specifies the 
cost associated with any classification, where the cost of classifying a data point 
which has label i as some label j is denoted as Cy (where Cy > 0). It is usually 
the case that cy = for i = j. We introduce the following notation: 

Given classifier / over discrete domain X, f : X — ► {1, k}, the risk of / is 
given by 

k 

X £X i=l 

Let /* be the Bayes optimal classifier, i.e. the function with the minimal risk, 
or optimal expected cost, and f'(x) is the function with optimal expected cost 
with respect to alternative distributions D' i: i £ {1, k}. For x £ X , 

f*(x) = argmin, £\ =1 c^.gi.D.ix) 
f'(x) =arg xnhijT,i=i c ij-9i-D' i (x). 

Theorem 1. 3 Let f* be the Bayes optimal classifier and let f be the Bayes 
classifier associated with estimated distributions D[. Suppose that for each label 
i £ {I, k}, Li(A, D'i) < e/ 9l . Then R(f') < R(f*) + e.k. maXy {a^}. 

Proof. Let Rf(x) be the contribution from x £ X towards the total expected 
cost associated with classifier /. For / such that f(x) = j, 

k 

Rf{x) = y^Cij.gj.Dijx). 

i=l 

Let Tg/-e(x) be the increase in risk for labelling x as £' instead of £, so that 

re-i{x) = Si=i c tl ,.g t .D l (x) - £\ =1 c t i.g t .D t (x) ^ 
= Si=i( c ^' _ cu).g l .D i (x). 

Note that due to the optimality of /* on Vx £ X : Tfi^ x ^f*^(x) > 0. 
In a similar way, the expected contribution to the total cost of /' from x must 
be less than or equal to that of /* with respect to D[ - given that /' is chosen 
to be optimal on the D[ values. We have: 

k k 

^Cif.^ygi-D'iix) < ^Cif.^ygi.D'^x). 

i=l i=l 

3 This result is essentially a generalization of Exercise 2.10 of Devroye et al's text- 
book |B], from 2 class to multiple classes, and in addition we show here that variable 
misclassification costs can be incorporated. This is the closest thing we have found 
to this Theorem that has already appeared, but we suspect that other related results 
may have appeared. We would welcome any further information or references on this 
topic. Theorem|5|is another result which we suspect may be known, but likewise we 
have found no statement of it. 
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Rearranging, we have 

k 



D'iix)^. (cif.fr) - Cif. fr)) > 0. (2) 



i=l 



From and J3J) it can be seen that 

T f'(x)-f(x){x) < (A - D'i(x)) -g t . (ciffr) - Cif.fr)) 

< Ei=i \( D * ~ D 'i( x ))\ -9i-\(cif'(x) - c tf*(x)) \ ■ 

Let di(x) be the difference between the probability densities of Z)j and D[ at 
x e X, di(x) = \Di(x) - D[(x)\. Therefore, 

k k 

Tf>fr)^f.fr){x) < Y \ c if'(x)-Ciffr)\-gi-dt{x) < Tf>fr^f.fr)(x) < ^max{c ii }.3i.d i (a 

i=l i=l 3 

In order to bound the expected cost, it is necessary to sum over the range of 
x e X: 



k k 



X T f'(x)~f*{x)(x) < X] ^2 m ^ x i c ij}-9i-di(x) =^2max{c ij }.g i . ^ d,(x). 

xGX i£X i=l 3 i=l 3 x£X 

(3) 

Since L\{Di,D'^) < e/gi for all i, ie. J2 x <ex di(x) < e/gi, it follows from 
that 

k , 

X r ( x ) < X max "L Cl ^- ft - ( — 

xex i=i 3 

This expression gives an upper bound on expected cost for labelling x as 
f'(x) instead of f*(x). By definition, 



r(x) = R(f') R(f* 



x£X 

Therefore it has been shown that 

k 

Y 

j 



R{f) < R(f*) + e- V max{ ClJ } < R(f*) + e.k. max{c l3 -}. 



i=l 

□ 

We next prove a corresponding result in terms of KL-divergence, which uses 
the negative log-likelihood of the correct label as the cost function. We de- 
fine Pri(x) to be the probability that a data point at x has label i, such that 

Pi'i(x) = gi.Di(x) ^Ej=i 9j-Dj{x)^ . Given a function / : X — * R fe , where 
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/(x) is a prediction of the probabilities of x having each label i G {1, k} (so 
Ei=i /ifa) = 1)) the risk associated with / can be expressed as 

k 

R(f) = E D(x)J2-^g(Mx)).Pr t (x). (4) 

X £X i=l 

Let /* : X — > R fc output the true class label distribution for an element of 
X. From Equation (JIJ it can be seen that 

fc 

) = £ D{x) J2 ~ log(Pr,(x)).Pr,(x). (5) 

x£X i=l 

Theorem 2. For / : X — > R fe suppose that R(f) is given by ^jj. If for each 
label i G {1, k}, I(D i \\D' i ) < e/g u then R(f') < R(f*) + fee. 

Proof. Let Rf(x) be the contribution at x G X to the risk associated with classi- 
6er /, Rf(x) = £-=i -logC/i^-Pr^x). Therefore R(f') = J2 xeX D(x).R f ,(x). 
We define Pr^ (x) to be the estimated probability that a data point at x G X 

has label i G {1, k}, from distributions such that Pr'^x) = gi-D[ (Ej=i gj.D'^x) 

k 

R f ,{x) = D{x). £ - log (Pr<(x)) .Pr 4 (x). 

i=l 

Let £(x) denote the contribution to additional risk incurred from using /' as 
opposed to /* at x G X. From it can be seen that 



£(x) = fl/,(x) - £>(x). £ - log (Pr,(x)) .Pr ( (x) 
»=i 

fe 

= D(x). £ Pr ( (x). (log (Pr 4 (x)) - log (Pr^(x))) 



k 



gi.Di{x) \ ( ( 3i.A(x) \ / 9i-D[{x) 



= D (x). V log - log . , 

-<4((^)-(-(ss)--(fesi)): 

We define I?' such that D'(x) = Ei=i gi.D'^x). Since it is the case that 



f 0*0 = Et=i 9i-Di(x), £(x) can be rewritten as 



e (-) - w Ett (*$?>) • (log (£38) - (fc)) 

= Eti (ft-A(») ^g (§g}) ) - D(x) log (»$) . 
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We define I(D\\D')(x) to be the contribution at x G X to the KL-divergence, 
such that I(D\\D')(x) = D(x)\og(D(x)/D'(x)). It follows that 

k 

E = E - I(D\\D'). (6) 

We know that the KL divergence between Di and D\ is bounded by e/pj for 
each label i € {1, k}, so © can be rewritten as 

E < E ffi- - ^n^') ^ fc - e - ^ii^o. 

ze* i=i V Vffi// 

Due to the fact that the KL-divergence between two distributions is non- 
negative, an upper bound on the cost can be obtained by letting I(D\\D') = 0, 
so R(f)-R(f*) < fee. Therefore it has been proved that R(f) < R{f*) + ke. □ 

2.2 Lower Bounds 

In this section we give lower bounds corresponding to the two upper bounds 
given in Sectional 

Example 1. Consider a distribution D over domain X = {xq,xi}, from which 
data is generated with labels and 1 and there is an equal probability of each 
label being generated (go = g\ = h ). Di(x) denotes the probability that a point 
is generated atiGl given that it has label i. Dq and D\ are distributions over 
X, such that at x G X, D(x) = \(D (x) + D^x)). 

Suppose that D' and D[ are approximations of Dq and D±, and that Li(D , D' Q ) = 
— — 2e and L\(Dx, D[) — = 2e, where e = e' + 7 (and 7 is an arbitrarily 
small constant). 

Given the following distributions, assuming that a misclassihcation results in 
a cost of 1 and that a correct classification results in no cost, it can be seen that 
R(f*) = l-e': 

D (x ) = -+e',D (x 1 ) = ^-e', 

Di(x ) = l- e ',D 1 {x 1 ) = \ + e'. 

Now if we have approximations D' and D[ as shown below, it can be seen 
that /' will misclassify for every value of x G X: 

D' {x Q ) = ---i,D' Q {x 1 ) = -+ 1 , 

D' l {x )= l -+l,D' 1 {x 1 )= l -- 1 . 

This results in R(f') = \ + e'. Therefore R(f') = R(f*) + 2e' = R(f*) + 
2(6-7)- 
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In this example the risk is only 27 under R(f*) + e.k. maxjlay}, since k = 2. 
A similar example can be used to give upper bounds to the lower bound given 
in Theorem 

Example 2. Consider distributions Do, Dx, D' Q and D[ over domain X = {xq, x\] 
as defined in Example It can be seen that the KL-divergence between each 
label's distribution and its approximated distribution is 

I(D \\D' ) = /(Uxlpi) = Q + e') log (|±£) + Q - e ') log . 

The optimal risk, measured in terms of negative log-likelihood, can be ex- 
pressed as R(f*) = - (i + e') log (§ + e') - (| - e') log (± - e'). The risk in- 
curred by using /' as the discriminant function is R(f') — — (i + e') log (i — 7) — 
(| - e') log (i + 7) . Therefore, 

= R(f*) + Q + e') log + Q - e') log (f^£) = + e. 

2.3 Learning near-optimal classifiers in the PAC sense 

We show that the results of Section f!TTl imply learnability within the framework 
defined in Section fOl 

The following corollaries refer to algorithms A c i ass and A c i ass > . These algo- 
rithms generate classifier functions /' : X — ► {1,2,..., k}, which label data in a 
fc-label classification problem, using L\ distance and ifL-divergence respectively 
as measurements of accuracy. 

Corollary 2] shows (using Theorem that a near optimal classifier can be 
constructed given that an algorithm exists which approximates a distribution 
over positive data in polynomial time. We are given cost matrix C, and assume 
knowledge of the class priors gi. 

Corollary 1. If an algorithm Al 1 approximates distributions within L\ distance 
e' with probability at least 1 — 8' , in time polynomial in 1/e' and 1/5', then an 
algorithm A c i ass exists which ( with probability 1—5) generates a discriminant 
function f with an associated risk of at most R(f*) + e, and A c i ass is polynomial 
in 1/5 and 1/e. 

Proof. A c i ass is a classification algorithm which uses unsupervised learners to fit 
a distribution to each label i 6 {1, k}, and then uses the Bayes classifier with 
respect to these estimated distributions, to label data. 

Al 1 is a PAC algorithm which learns from a sample of positive data to 
estimate a distribution over that data. A c i ass generates a sample N of data, and 
divides N into sets {Nx, ...,Nf~}, such that Ni contains all members of N with 
label i. Note that for all labels i, |JV,| ss gi-\N\. 

With a probability of at least 1 — ^(5/k), A^ 1 generates an estimate D' of 
the distribution Di over label i, such that Lx(Di,D') < e {gi.k. maxjj{c.y }) _1 . 



10 



Therefore the size of the sample |JV,*| must be polynomial in gi.k. maxy{cy}/e 
and k/S). For all i s {1, k} gi < 1, so |JVj| is polynomial in maxy{cy}, k, 1/e 
and 1/5. 

When A c i ass combines the distributions returned by the k iterations of Al 1 , 
there is a probability of at least 1 — 6/2 that all of the distributions are within 
e (gi-k. maxjj{cfj}) 1 £i distance of the true distributions (given that each it- 
eration received a sufficiently large sample). We allow a probability of 5/2 that 
the initial sample N did not contain a good representation of all labels (— >Vi £ 
{1, ...k} : |JV,| « <7j.|iV|), and as such - one or more iteration of Al 1 may not 
have received a sufficiently large sample to learn the distribution accurately. 

Therefore with probability at least 1 — 5, all approximated distributions are 
within e(gi.k.m.&Xij{cij})~ 1 L\ distance of the true distributions. If we use the 
classifier which is optimal on these approximated distributions, /', then the 
increase in risk associated with using /' instead of the Bayes Optimal Classifier, 
/*, is at most e. It has been shown that Al 1 requires a sample of size polynomial 
in 1/e, 1/5, k and maxy jcy }. It follows that 

□ 

Corollary [5] shows (using Theorem 0) how a near optimal classifier can be 
constructed given that an algorithm exists which approximates a distribution 
over positive data in polynomial time. 

Corollary 2. If an algorithm Akl has a probability of at least 1 — 5 of approxi- 
mating distributions within e KL- divergence, in time polynomial in 1/e and 1/5, 
then an algorithm A c i ass i exists which (with probability 1 — 5) generates a func- 
tion f that maps x G X to a conditional distribution over class labels of x, with 
an associated log-likelihood risk of at most R(f*) + e, and A c i ass > is polynomial 
in 1/5 and 1/e. 

Proof. A c i ass i is a classification algorithm using the same method as A c i ass in 
Corollary ^ whereby a sample N is divided into sets {N%, ...,Nk}, and each set 
is passed to algorithm Akl where a distribution is estimated over the data in 
the set. 

With a probability of at least 1 — h(5/k), Akl generates an estimate D' of 
the distribution D; over label i, such that J(A||-D') < e^.fc)" 1 . Therefore the 
size of the sample \N%\ must be polynomial in gi.k/e and k/5. Since gt < 1, \Ni\ 
is polynomial in k/e and k/5. 

When A c i ass i combines the distributions returned by the k iterations of Akl, 
there is a probability of at least 1 — 6/2 that all of the distributions are within 
e(gi.k)^ 1 XL-divergence of the true distributions. We allow a probability of 6/2 
that the initial sample N did not contain a good representation of all labels 
(-.Vt S {l,...fc} : \Ni\t*gi.\N\). 
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Therefore with probability at least 1 — 8, all approximated distributions are 
within e{gi.k)~ 1 XL-divergence of the true distributions. If we use the classifier 
which is optimal on these approximated distributions, /', then the increase in 
risk associated with using /' instead of the Bayes Optimal Classifier /*, is at 
most e. It has been shown that Akl requires a sample of size polynomial in 1/e, 
1/8 and k. Let p(l/e, 1/8) be an upper bound on the time and sample size used 
by Akl- It follows that 

m-pm-±,(\.\)eo(k,(i.\)). 

□ 

2.4 Smoothing: from L± distance to KL-divergence 

Given a distribution that has accuracy e under the L\ distance, is there a generic 
way to "smooth" it so that it has similar accuracy under the KL-divergence? 
From this can be done for X — {0, 1}™, if we are interested in algorithms 
that are polynomial in n in addition to other parameters. Suppose however that 
the domain is bit strings of unlimited length. Here we give a related but weaker 
result in terms of bit strings that are used to represent distributions, as opposed 
to members of the domain. We define class V of distributions specified by bit 
strings, such that each member of 2? is a distribution on discrete domain X, 
represented by a discrete probability scale. Let Ld be the length of the bit 
string describing distribution D. Note that there are at most 2 L ° distributions 
in T> represented by strings of length L d ■ 

Lemma 1. Suppose D <ET> is learnable under L\ distance in time polynomial in 
8, e and Ld- Then T> is learnable under KL-divergence, with polynomial sample 
size. 

Proof. Let D be a member of class T>, represented by a bit string of length 
Ld, and let algorithm A be an algorithm which takes an input set <S* (where 
\S\ is polynomial in e, i5 and Ld) of samples generated i.i.d. from distribution 
D, and with probability at least 1 — 8 returns a distribution Dl 1 , such that 
L 1 (D,D Ll )<e. 

Let £ = (e 2 /Lz>y We define algorithm A' such that with probability at 
least 1 — 8, A' returns distribution D' L , where L\{D,D' L ) < £. Algorithm A' 
runs A with sample S' , where \S'\ is polynomial in £, 8 and Ld (and it should 
be noted that \S'\ is polynomial in e, 8 and Ld). 

We define Dl d to be the unweighted mixture of all distributions in T> rep- 
resented by length Ld bit strings, Dl d (x) = 2~ L ° J2dev D{x). We now define 
distribution D' KL such that D' KL (x) = (1 - £)D' Ll (x) + £.D Ld (x). 

By the definition of D' KL , L\{p' Ll , D' KL ) < 2£. With probability at least 1-8, 
Li(D, D' L1 ) < f, and therefore with probability at least 1 — 8, Li(D, D' KL ) < 31;. 

We define X< = {x <E X\D' KL (x) < D(x)}. Members of X< contribute 
positively to L(D\\D' KL ). Therefore 



12 



I(D\\D' KL )<j: xeX< D(x) I J 



i(D' KL (x)) 



= E X&X< (D(x) D' KL {x)) (7) 

+ 2^xeX < KL \ ) {lo g (D' KL (x)) ) ■ 

We have shown that L\(D, D' KL ) < 3£, so T, xe x ( D ( x ) - d 'kl( x )) < 3 £- 
Analysing the first term in J7J), 



Note that for all x E X, D' KL (x) > f.2~ L ° . It follows that 

™ x ( /Tl^fU < log(2 L -/0 = ^ - log(0- 



xex < \\og{D' KL {x)) 
Examining the second term in 0, 

Dkl{x) W(d' kl (x))) 2. ^) li^r 

where /i^ = -D(a;) — D' KL (x), which is a positive quantity for all x 6 X<. Due to 
the concavity of the logarithm function, it follows that 



E xeX< D' KL {x) ( Mgk^+M ) < Ek£X< D' KL {x)h x [^(log(y)) 



v=D' K Ax) 



Therefore, I(D\\D' KL ) < 3£(1 + L D - log(£)). For values of £ < ^ (e 2 /L D ), 
it can be seen that i(£>| l-D^i) — e - ^ 

Corollary 3. Consider the problem of learning PDFAs having n states, over 
alphabet S, and probabilities represented by bit strings of length £. Using sample 
size (but not time) polynomial in n, \S\ and I (and the PAG parameters e and 
S), a distribution is this class can be estimated within KL distance e. 

The proof follows from the observation that such a PDFA can be represented 
using a bit string whose length is polynomial in the parameters. 

Consequently we can learn the same class of PDFAs under the KL-divergence 
that can be learned under the L\ distance in JS], i.e. PDFAs with distinguishable 
states but no restriction on the expected length of their outputs. However, note 
that the hypothesis is "inefficient" (a mixture of exponentially many PDFAs). 
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3 Conclusion 

We have shown a close relationship between the error of an estimated input 
distribution (as measured by L\ distance or KL-divergence) and the error rate 
of the resulting classifier. In situations where we believe that input distributions 
may be accurately estimated, the resulting information about the data may be 
more useful than just a near-optimal classifier. 

A general issue of interest is the question of when one can obtain good classi- 
fier from estimated distributions that satisfy weaker goodness-of-approximation 
criteria than those considered here. Suppose for example that elements of a 2- 
clcmcnt domain {xi,xz} are being labeled by the stochastic rule that assigns 
labels and 1 to either element of the domain, with equal probability. Then any 
classifier does no better than random labeling, and so we can use arbitrary dis- 
tributions D' Q and D[ as estimates of the distributions Dq and D\ over examples 
with label and 1 respectively. In [H] we show that in the basic PAC framework 
we can sometimes design discriminant functions based on unlabeled data sets, 
that result in PAC classifiers without any guarantee on how well-estimated is 
the input distribution. Further work should possibly compromise between the 
distribution-free setting, and the objective — considered here — of approximating 
the input distributions in a strong sense. 
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